🎵 Operaciones de Aprendizaje Automático (MLOps)¶
╔══════════════════════════════════════════════════════════════════╗
║ 🎼 TURKISH MUSIC EMOTION ║
║ Machine Learning Operations Project ║
╚══════════════════════════════════════════════════════════════════╝
👨🏫 Equipo Docente¶
|
👔 Profesores Titulares
|
🎓 Equipo de Apoyo
|
🎯 Dataset del Proyecto¶
👥 Equipo de Desarrollo¶
|
David Cruz Beltrán🔧 Software Engineer |
Javier Augusto Rebull Saucedo⚙️ SRE / Data Engineer |
Sandra Luz Cervantes Espinoza🤖 ML Engineer / Data Scientist |
"Aplicando MLOps para clasificación de emociones en música turca" 🎼✨
📋 Objetivos del Proyecto¶
🎯 Análisis¶
|
🛠️ Desarrollo¶
|
🤖 Modelado¶
|
🔧 Stack Tecnológico¶
| Categoría | Herramientas |
|---|---|
| 💻 Lenguaje | |
| 📊 Data Analysis | |
| 🤖 ML Framework | |
| 🔄 Versioning | |
| 📈 Visualization |
🎼 Clasificación de Emociones¶
| Emoción | Características Musicales | Aplicaciones |
|---|---|---|
| 😊 Happy | Tempo rápido, tonalidad mayor | Playlists energéticas, marketing |
| 😢 Sad | Tempo lento, tonalidad menor | Terapia musical, cine |
| 😠 Angry | Alta intensidad, disonancia | Detección de contenido, videojuegos |
| 😌 Relax | Tempo moderado, armonías suaves | Meditación, ambientes |
🚀 Metodología MLOps¶
graph LR
A[📥 Data Ingestion] --> B[🧹 Data Cleaning]
B --> C[🔍 EDA]
C --> D[⚙️ Feature Engineering]
D --> E[🔄 Data Versioning - DVC]
E --> F[🤖 Model Training]
F --> G[🎛️ Hyperparameter Tuning]
G --> H[📊 Model Evaluation]
H --> I[📝 Documentation]
📈 Entregables¶
| # | Componente | Estado | Peso |
|---|---|---|---|
| 1️⃣ | ML Canvas & Propuesta de Valor | 🔄 En Progreso | 15% |
| 2️⃣ | EDA & Data Cleaning | 🔄 En Progreso | 20% |
| 3️⃣ | Feature Engineering | 🔄 En Progreso | 15% |
| 4️⃣ | Data Versioning (DVC) | 🔄 En Progreso | 15% |
| 5️⃣ | Model Development | 🔄 En Progreso | 20% |
| 6️⃣ | Evaluation & Documentation | 🔄 En Progreso | 15% |
🔗 Enlaces del Proyecto¶
🎵 Transformando emociones musicales en conocimiento mediante MLOps 🤖¶
📅 Fecha de entrega: 13 de Octubre 2025 • 01:00 hrs
📧 Contacto: A través de Canvas
Proyecto desarrollado como parte de la Maestría en Inteligencia Artificial Aplicada
Instituto Tecnológico y de Estudios Superiores de Monterrey
Información:
- Dataset Turkish Music Emotion archivo "Acoustic Features.csv".
#########################################
# Manejo y Análisis de Datos #
#########################################
import numpy as np # Para computación numérica y manejo de arrays
import pandas as pd # Para manipulación y análisis de datos tabulares (DataFrames)
#################################
# Visualización de Datos #
#################################
import matplotlib.pyplot as plt # Para crear gráficos y visualizaciones estáticas
import seaborn as sns # Para visualizaciones estadísticas atractivas
###################################################
# Utilidades del Sistema y Matemáticas #
###################################################
import os # Para interactuar con el sistema operativo
import math # Para funciones matemáticas básicas
from scipy import stats # Para funciones estadísticas avanzadas
################################################
# Aprendizaje Automático (Scikit-learn) #
################################################
# --- Preprocesamiento y Transformación de Datos ---
from sklearn.preprocessing import LabelEncoder, StandardScaler # Para codificar y estandarizar características
from sklearn.preprocessing import FunctionTransformer # Para aplicar funciones personalizadas a los datos
from sklearn.preprocessing import PowerTransformer # Para aplicar transformaciones de potencia a los datos
from sklearn.decomposition import PCA # Para reducción de dimensionalidad
from sklearn.pipeline import Pipeline # Para encadenar pasos de procesamiento y modelado
# --- Selección y Evaluación de Modelos ---
from sklearn.model_selection import train_test_split, cross_val_score # Para dividir datos y realizar validación cruzada
from sklearn.model_selection import GridSearchCV # Para búsqueda de hiperparámetros
from sklearn.metrics import accuracy_score, confusion_matrix # Para medir la precisión y ver la matriz de confusión
from sklearn.metrics import classification_report # Para un reporte detallado de métricas
# --- Modelos de Clasificación ---
from sklearn.linear_model import LogisticRegression # Modelo de Regresión Logística
from sklearn.neighbors import KNeighborsClassifier # Modelo K-Vecinos más Cercanos
from sklearn.svm import SVC # Máquinas de Vectores de Soporte (SVM)
from sklearn.naive_bayes import GaussianNB # Clasificador Naive Bayes Gaussiano
from sklearn.tree import DecisionTreeClassifier # Modelo de Árbol de Decisión
from sklearn.ensemble import RandomForestClassifier # Modelo de Bosque Aleatorio (Random Forest)
from sklearn.neural_network import MLPClassifier # Red Neuronal (Perceptrón Multicapa)
#################################################
# Librerías Personalizadas del Proyecto 📦 #
#################################################
from acoustic_ml.dataset import load_raw_data # Para cargar los datos crudos del proyecto
from acoustic_ml.features import create_features # Para crear nuevas características para el modelo
from acoustic_ml.modeling.predict import load_model, predict # Para cargar y usar un modelo ya entrenado
from acoustic_ml.dataset import load_raw_data # Importa la función necesaria del módulo
df = load_raw_data() # Ejecuta la función para cargar los datos
print(f"✓ Dataset cargado: {df.shape[0]} filas y {df.shape[1]} columnas") # Verifica y reporta las dimensiones del dataset
✓ Dataset cargado: 400 filas y 51 columnas
📊 Proceso de Análisis de Datos¶
🔍 Análisis Exploratorio (EDA)¶
- Análisis descriptivo → Estadísticas y resumen de datos
- Análisis de variables numéricas → Distribuciones y tendencias
- Análisis de variables de texto → Frecuencias y patrones
- Análisis de correlación → Relaciones bivariantes y multivariantes
🧹 Preprocesamiento¶
- Valores faltantes → Identificación y tratamiento
- Valores atípicos → Detección y manejo
⚙️ Ingeniería de Características¶
Creación y transformación de variables para mejorar el modelo
🚀 Entrenamiento y Evaluación del Modelo¶
Desarrollo, validación y optimización del modelo predictivo
def info_df(df):
"""
Genera un DataFrame de resumen estilizado con información clave de cada columna.
"""
print(f"Análisis del DataFrame: {df.shape[0]} filas y {df.shape[1]} columnas")
# 1. Crear el DataFrame de resumen
resumen = pd.DataFrame({
'Tipo de Dato': df.dtypes,
'Valores No Nulos': df.count(),
'Valores Nulos': df.isnull().sum(),
'% Nulos': round(df.isnull().sum() * 100 / len(df), 2),
'Valores Únicos': df.nunique(),
'Cardinalidad (%)': round(df.nunique() * 100 / len(df), 2)
})
# 2. Ordenar por el porcentaje de nulos para priorizar
resumen = resumen.sort_values(by='% Nulos', ascending=False)
# 3. Aplicar estilo visual
styled_resumen = (resumen.style
.background_gradient(cmap='viridis', subset=['% Nulos'])
.background_gradient(cmap='plasma_r', subset=['Cardinalidad (%)'])
.format({'% Nulos': '{:.2f}%', 'Cardinalidad (%)': '{:.2f}%'})
.bar(subset=['Valores No Nulos'], color='#5fba7d')
.set_caption("Resumen Analítico del Dataset")
)
return styled_resumen
# ¡Llama a la función con nuestro DataFrame!
info_df(df)
Análisis del DataFrame: 400 filas y 51 columnas
| Tipo de Dato | Valores No Nulos | Valores Nulos | % Nulos | Valores Únicos | Cardinalidad (%) | |
|---|---|---|---|---|---|---|
| Class | object | 400 | 0 | 0.00% | 4 | 1.00% |
| _Chromagram_Mean_6 | float64 | 400 | 0 | 0.00% | 241 | 60.25% |
| _Spectralspread_Mean | float64 | 400 | 0 | 0.00% | 388 | 97.00% |
| _Spectralskewness_Mean | float64 | 400 | 0 | 0.00% | 357 | 89.25% |
| _Spectralkurtosis_Mean | float64 | 400 | 0 | 0.00% | 381 | 95.25% |
| _Spectralflatness_Mean | float64 | 400 | 0 | 0.00% | 94 | 23.50% |
| _EntropyofSpectrum_Mean | float64 | 400 | 0 | 0.00% | 134 | 33.50% |
| _Chromagram_Mean_1 | float64 | 400 | 0 | 0.00% | 259 | 64.75% |
| _Chromagram_Mean_2 | float64 | 400 | 0 | 0.00% | 240 | 60.00% |
| _Chromagram_Mean_3 | float64 | 400 | 0 | 0.00% | 263 | 65.75% |
| _Chromagram_Mean_4 | float64 | 400 | 0 | 0.00% | 240 | 60.00% |
| _Chromagram_Mean_5 | float64 | 400 | 0 | 0.00% | 286 | 71.50% |
| _Chromagram_Mean_7 | float64 | 400 | 0 | 0.00% | 243 | 60.75% |
| _Brightness_Mean | float64 | 400 | 0 | 0.00% | 277 | 69.25% |
| _Chromagram_Mean_8 | float64 | 400 | 0 | 0.00% | 269 | 67.25% |
| _Chromagram_Mean_9 | float64 | 400 | 0 | 0.00% | 262 | 65.50% |
| _Chromagram_Mean_10 | float64 | 400 | 0 | 0.00% | 245 | 61.25% |
| _Chromagram_Mean_11 | float64 | 400 | 0 | 0.00% | 273 | 68.25% |
| _Chromagram_Mean_12 | float64 | 400 | 0 | 0.00% | 252 | 63.00% |
| _HarmonicChangeDetectionFunction_Mean | float64 | 400 | 0 | 0.00% | 178 | 44.50% |
| _HarmonicChangeDetectionFunction_Std | float64 | 400 | 0 | 0.00% | 159 | 39.75% |
| _HarmonicChangeDetectionFunction_Slope | float64 | 400 | 0 | 0.00% | 237 | 59.25% |
| _HarmonicChangeDetectionFunction_PeriodFreq | float64 | 400 | 0 | 0.00% | 40 | 10.00% |
| _HarmonicChangeDetectionFunction_PeriodAmp | float64 | 400 | 0 | 0.00% | 196 | 49.00% |
| _Spectralcentroid_Mean | float64 | 400 | 0 | 0.00% | 388 | 97.00% |
| _Pulseclarity_Mean | float64 | 400 | 0 | 0.00% | 266 | 66.50% |
| _RMSenergy_Mean | float64 | 400 | 0 | 0.00% | 196 | 49.00% |
| _MFCC_Mean_8 | float64 | 400 | 0 | 0.00% | 273 | 68.25% |
| _Lowenergy_Mean | float64 | 400 | 0 | 0.00% | 166 | 41.50% |
| _Fluctuation_Mean | float64 | 400 | 0 | 0.00% | 377 | 94.25% |
| _Tempo_Mean | float64 | 400 | 0 | 0.00% | 388 | 97.00% |
| _MFCC_Mean_1 | float64 | 400 | 0 | 0.00% | 354 | 88.50% |
| _MFCC_Mean_2 | float64 | 400 | 0 | 0.00% | 347 | 86.75% |
| _MFCC_Mean_3 | float64 | 400 | 0 | 0.00% | 319 | 79.75% |
| _MFCC_Mean_4 | float64 | 400 | 0 | 0.00% | 316 | 79.00% |
| _MFCC_Mean_5 | float64 | 400 | 0 | 0.00% | 297 | 74.25% |
| _MFCC_Mean_6 | float64 | 400 | 0 | 0.00% | 297 | 74.25% |
| _MFCC_Mean_7 | float64 | 400 | 0 | 0.00% | 304 | 76.00% |
| _MFCC_Mean_9 | float64 | 400 | 0 | 0.00% | 278 | 69.50% |
| _Eventdensity_Mean | float64 | 400 | 0 | 0.00% | 163 | 40.75% |
| _MFCC_Mean_10 | float64 | 400 | 0 | 0.00% | 271 | 67.75% |
| _MFCC_Mean_11 | float64 | 400 | 0 | 0.00% | 253 | 63.25% |
| _MFCC_Mean_12 | float64 | 400 | 0 | 0.00% | 272 | 68.00% |
| _MFCC_Mean_13 | float64 | 400 | 0 | 0.00% | 259 | 64.75% |
| _Roughness_Mean | float64 | 400 | 0 | 0.00% | 388 | 97.00% |
| _Roughness_Slope | float64 | 400 | 0 | 0.00% | 292 | 73.00% |
| _Zero-crossingrate_Mean | float64 | 400 | 0 | 0.00% | 388 | 97.00% |
| _AttackTime_Mean | float64 | 400 | 0 | 0.00% | 61 | 15.25% |
| _AttackTime_Slope | float64 | 400 | 0 | 0.00% | 274 | 68.50% |
| _Rolloff_Mean | float64 | 400 | 0 | 0.00% | 388 | 97.00% |
| _HarmonicChangeDetectionFunction_PeriodEntropy | float64 | 400 | 0 | 0.00% | 26 | 6.50% |
estadisticas_descriptivas = df.describe()
estadisticas_descriptivas
| _RMSenergy_Mean | _Lowenergy_Mean | _Fluctuation_Mean | _Tempo_Mean | _MFCC_Mean_1 | _MFCC_Mean_2 | _MFCC_Mean_3 | _MFCC_Mean_4 | _MFCC_Mean_5 | _MFCC_Mean_6 | ... | _Chromagram_Mean_9 | _Chromagram_Mean_10 | _Chromagram_Mean_11 | _Chromagram_Mean_12 | _HarmonicChangeDetectionFunction_Mean | _HarmonicChangeDetectionFunction_Std | _HarmonicChangeDetectionFunction_Slope | _HarmonicChangeDetectionFunction_PeriodFreq | _HarmonicChangeDetectionFunction_PeriodAmp | _HarmonicChangeDetectionFunction_PeriodEntropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | ... | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 | 400.000000 |
| mean | 0.134650 | 0.553605 | 7.145932 | 123.682020 | 2.456422 | 0.071890 | 0.488065 | 0.030465 | 0.178897 | 0.038307 | ... | 0.354632 | 0.590975 | 0.342340 | 0.385620 | 0.328213 | 0.192997 | -0.000157 | 1.762288 | 0.769690 | 0.966712 |
| std | 0.064368 | 0.050750 | 2.280145 | 34.234344 | 0.799262 | 0.537865 | 0.294607 | 0.275839 | 0.195230 | 0.203754 | ... | 0.334976 | 0.357981 | 0.315808 | 0.348117 | 0.055520 | 0.047092 | 0.104743 | 0.930352 | 0.072107 | 0.003841 |
| min | 0.010000 | 0.302000 | 3.580000 | 48.284000 | 0.323000 | -3.484000 | -0.870000 | -1.636000 | -0.494000 | -0.916000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.112000 | 0.060000 | -0.285000 | 0.187000 | 0.530000 | 0.939000 |
| 25% | 0.085000 | 0.523000 | 5.859500 | 101.490250 | 1.948500 | -0.262750 | 0.281250 | -0.117000 | 0.061250 | -0.078250 | ... | 0.066750 | 0.264500 | 0.059500 | 0.060750 | 0.290750 | 0.160000 | -0.058000 | 0.961000 | 0.725000 | 0.965000 |
| 50% | 0.128000 | 0.553000 | 6.734000 | 120.132500 | 2.389500 | 0.068500 | 0.464500 | 0.044500 | 0.181000 | 0.049500 | ... | 0.247000 | 0.612000 | 0.247000 | 0.296500 | 0.333000 | 0.190000 | -0.002000 | 1.682000 | 0.786000 | 0.967000 |
| 75% | 0.174000 | 0.583250 | 7.823500 | 148.986250 | 2.860250 | 0.413250 | 0.686000 | 0.198250 | 0.288500 | 0.151250 | ... | 0.612000 | 1.000000 | 0.565250 | 0.670750 | 0.367250 | 0.226000 | 0.063250 | 2.243000 | 0.824000 | 0.969000 |
| max | 0.431000 | 0.703000 | 23.475000 | 195.026000 | 5.996000 | 1.937000 | 1.622000 | 1.126000 | 1.055000 | 0.799000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.488000 | 0.340000 | 0.442000 | 4.486000 | 0.908000 | 0.977000 |
8 rows × 50 columns
def descripcion_df(df):
"""
Genera una tabla de estadísticas descriptivas completa y estilizada
para las columnas numéricas.
"""
# Seleccionar solo columnas numéricas
df_num = df.select_dtypes(include=['number'])
if df_num.empty:
print("No se encontraron columnas numéricas en el DataFrame.")
return
# Calcular estadísticas básicas
desc = df_num.describe().T
# Añadir estadísticas avanzadas
desc['skew'] = df_num.skew()
desc['kurtosis'] = df_num.kurt()
desc['median'] = df_num.median()
# Reordenar para que la mediana esté junto a la media
desc = desc[['count', 'mean', 'median', 'std', 'min', '25%', '50%', '75%', 'max', 'skew', 'kurtosis']]
print(f"Estadísticas Descriptivas para {df_num.shape[1]} columnas numéricas:")
# Aplicar estilo para resaltar valores importantes
styled_desc = (desc.style
.background_gradient(cmap='coolwarm', subset=['skew', 'kurtosis'])
.format('{:.2f}')
.set_caption("Estadísticas Descriptivas Avanzadas")
)
return styled_desc
# ¡Llama a la función con nuestro DataFrame!
descripcion_df(df)
Estadísticas Descriptivas para 50 columnas numéricas:
| count | mean | median | std | min | 25% | 50% | 75% | max | skew | kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| _RMSenergy_Mean | 400.00 | 0.13 | 0.13 | 0.06 | 0.01 | 0.09 | 0.13 | 0.17 | 0.43 | 0.71 | 0.62 |
| _Lowenergy_Mean | 400.00 | 0.55 | 0.55 | 0.05 | 0.30 | 0.52 | 0.55 | 0.58 | 0.70 | -0.39 | 2.17 |
| _Fluctuation_Mean | 400.00 | 7.15 | 6.73 | 2.28 | 3.58 | 5.86 | 6.73 | 7.82 | 23.48 | 2.89 | 12.73 |
| _Tempo_Mean | 400.00 | 123.68 | 120.13 | 34.23 | 48.28 | 101.49 | 120.13 | 148.99 | 195.03 | 0.12 | -0.63 |
| _MFCC_Mean_1 | 400.00 | 2.46 | 2.39 | 0.80 | 0.32 | 1.95 | 2.39 | 2.86 | 6.00 | 0.87 | 1.95 |
| _MFCC_Mean_2 | 400.00 | 0.07 | 0.07 | 0.54 | -3.48 | -0.26 | 0.07 | 0.41 | 1.94 | -0.68 | 4.91 |
| _MFCC_Mean_3 | 400.00 | 0.49 | 0.46 | 0.29 | -0.87 | 0.28 | 0.46 | 0.69 | 1.62 | 0.11 | 1.19 |
| _MFCC_Mean_4 | 400.00 | 0.03 | 0.04 | 0.28 | -1.64 | -0.12 | 0.04 | 0.20 | 1.13 | -0.70 | 4.08 |
| _MFCC_Mean_5 | 400.00 | 0.18 | 0.18 | 0.20 | -0.49 | 0.06 | 0.18 | 0.29 | 1.05 | 0.08 | 1.19 |
| _MFCC_Mean_6 | 400.00 | 0.04 | 0.05 | 0.20 | -0.92 | -0.08 | 0.05 | 0.15 | 0.80 | -0.28 | 2.58 |
| _MFCC_Mean_7 | 400.00 | 0.06 | 0.07 | 0.18 | -0.94 | -0.04 | 0.07 | 0.17 | 0.57 | -0.79 | 3.35 |
| _MFCC_Mean_8 | 400.00 | 0.04 | 0.04 | 0.17 | -0.74 | -0.05 | 0.04 | 0.13 | 0.73 | -0.11 | 2.86 |
| _MFCC_Mean_9 | 400.00 | 0.02 | 0.02 | 0.16 | -0.62 | -0.07 | 0.02 | 0.12 | 0.54 | -0.13 | 1.29 |
| _MFCC_Mean_10 | 400.00 | 0.03 | 0.03 | 0.15 | -0.54 | -0.06 | 0.03 | 0.13 | 0.51 | -0.25 | 1.08 |
| _MFCC_Mean_11 | 400.00 | 0.03 | 0.04 | 0.14 | -0.49 | -0.04 | 0.04 | 0.11 | 0.49 | -0.30 | 0.91 |
| _MFCC_Mean_12 | 400.00 | 0.02 | 0.02 | 0.13 | -0.42 | -0.06 | 0.02 | 0.09 | 0.35 | -0.35 | 0.40 |
| _MFCC_Mean_13 | 400.00 | 0.02 | 0.04 | 0.13 | -0.62 | -0.05 | 0.04 | 0.10 | 0.54 | -0.49 | 2.02 |
| _Roughness_Mean | 400.00 | 527.68 | 367.58 | 521.22 | 0.94 | 169.19 | 367.58 | 734.37 | 3899.85 | 2.08 | 6.73 |
| _Roughness_Slope | 400.00 | 0.07 | 0.07 | 0.17 | -0.53 | -0.03 | 0.07 | 0.17 | 0.58 | -0.02 | 0.42 |
| _Zero-crossingrate_Mean | 400.00 | 997.25 | 893.49 | 524.90 | 149.49 | 592.27 | 893.49 | 1303.49 | 3147.91 | 0.93 | 0.69 |
| _AttackTime_Mean | 400.00 | 0.03 | 0.03 | 0.02 | 0.01 | 0.02 | 0.03 | 0.03 | 0.17 | 3.38 | 17.11 |
| _AttackTime_Slope | 400.00 | -0.00 | 0.01 | 0.15 | -0.47 | -0.09 | 0.01 | 0.09 | 0.60 | -0.01 | 0.47 |
| _Rolloff_Mean | 400.00 | 5691.07 | 5648.63 | 2293.40 | 887.15 | 3933.55 | 5648.63 | 7355.89 | 11508.30 | 0.10 | -0.61 |
| _Eventdensity_Mean | 400.00 | 2.78 | 2.77 | 1.33 | 0.23 | 1.74 | 2.77 | 3.69 | 7.95 | 0.48 | 0.13 |
| _Pulseclarity_Mean | 400.00 | 0.25 | 0.22 | 0.16 | 0.01 | 0.13 | 0.22 | 0.33 | 0.86 | 1.16 | 1.21 |
| _Brightness_Mean | 400.00 | 0.43 | 0.45 | 0.13 | 0.05 | 0.35 | 0.45 | 0.53 | 0.74 | -0.40 | -0.07 |
| _Spectralcentroid_Mean | 400.00 | 2581.17 | 2547.68 | 863.52 | 606.52 | 1981.56 | 2547.68 | 3182.57 | 5326.38 | 0.23 | -0.13 |
| _Spectralspread_Mean | 400.00 | 3082.39 | 3150.95 | 767.65 | 814.82 | 2506.77 | 3150.95 | 3684.33 | 4721.48 | -0.26 | -0.43 |
| _Spectralskewness_Mean | 400.00 | 1.87 | 1.69 | 0.88 | 0.39 | 1.33 | 1.69 | 2.18 | 7.86 | 2.28 | 9.12 |
| _Spectralkurtosis_Mean | 400.00 | 7.35 | 5.22 | 8.62 | 1.93 | 3.88 | 5.22 | 7.85 | 122.00 | 7.97 | 89.68 |
| _Spectralflatness_Mean | 400.00 | 0.05 | 0.05 | 0.03 | 0.01 | 0.03 | 0.05 | 0.06 | 0.21 | 1.49 | 5.17 |
| _EntropyofSpectrum_Mean | 400.00 | 0.87 | 0.88 | 0.04 | 0.74 | 0.85 | 0.88 | 0.90 | 0.94 | -0.98 | 0.96 |
| _Chromagram_Mean_1 | 400.00 | 0.35 | 0.27 | 0.32 | 0.00 | 0.06 | 0.27 | 0.55 | 1.00 | 0.70 | -0.72 |
| _Chromagram_Mean_2 | 400.00 | 0.25 | 0.14 | 0.29 | 0.00 | 0.02 | 0.14 | 0.40 | 1.00 | 1.21 | 0.43 |
| _Chromagram_Mean_3 | 400.00 | 0.37 | 0.29 | 0.32 | 0.00 | 0.08 | 0.29 | 0.58 | 1.00 | 0.69 | -0.73 |
| _Chromagram_Mean_4 | 400.00 | 0.21 | 0.10 | 0.25 | 0.00 | 0.02 | 0.10 | 0.32 | 1.00 | 1.49 | 1.48 |
| _Chromagram_Mean_5 | 400.00 | 0.35 | 0.27 | 0.30 | 0.00 | 0.09 | 0.27 | 0.54 | 1.00 | 0.78 | -0.46 |
| _Chromagram_Mean_6 | 400.00 | 0.26 | 0.14 | 0.29 | 0.00 | 0.02 | 0.14 | 0.45 | 1.00 | 1.11 | 0.19 |
| _Chromagram_Mean_7 | 400.00 | 0.24 | 0.14 | 0.28 | 0.00 | 0.03 | 0.14 | 0.36 | 1.00 | 1.33 | 0.93 |
| _Chromagram_Mean_8 | 400.00 | 0.39 | 0.30 | 0.33 | 0.00 | 0.10 | 0.30 | 0.64 | 1.00 | 0.57 | -0.96 |
| _Chromagram_Mean_9 | 400.00 | 0.35 | 0.25 | 0.33 | 0.00 | 0.07 | 0.25 | 0.61 | 1.00 | 0.76 | -0.77 |
| _Chromagram_Mean_10 | 400.00 | 0.59 | 0.61 | 0.36 | 0.00 | 0.26 | 0.61 | 1.00 | 1.00 | -0.22 | -1.44 |
| _Chromagram_Mean_11 | 400.00 | 0.34 | 0.25 | 0.32 | 0.00 | 0.06 | 0.25 | 0.57 | 1.00 | 0.71 | -0.69 |
| _Chromagram_Mean_12 | 400.00 | 0.39 | 0.30 | 0.35 | 0.00 | 0.06 | 0.30 | 0.67 | 1.00 | 0.55 | -1.12 |
| _HarmonicChangeDetectionFunction_Mean | 400.00 | 0.33 | 0.33 | 0.06 | 0.11 | 0.29 | 0.33 | 0.37 | 0.49 | -0.47 | 0.49 |
| _HarmonicChangeDetectionFunction_Std | 400.00 | 0.19 | 0.19 | 0.05 | 0.06 | 0.16 | 0.19 | 0.23 | 0.34 | 0.25 | -0.11 |
| _HarmonicChangeDetectionFunction_Slope | 400.00 | -0.00 | -0.00 | 0.10 | -0.28 | -0.06 | -0.00 | 0.06 | 0.44 | 0.20 | 0.93 |
| _HarmonicChangeDetectionFunction_PeriodFreq | 400.00 | 1.76 | 1.68 | 0.93 | 0.19 | 0.96 | 1.68 | 2.24 | 4.49 | 0.40 | -0.29 |
| _HarmonicChangeDetectionFunction_PeriodAmp | 400.00 | 0.77 | 0.79 | 0.07 | 0.53 | 0.72 | 0.79 | 0.82 | 0.91 | -0.76 | 0.18 |
| _HarmonicChangeDetectionFunction_PeriodEntropy | 400.00 | 0.97 | 0.97 | 0.00 | 0.94 | 0.96 | 0.97 | 0.97 | 0.98 | -1.48 | 7.57 |
def visualizar_distribuciones_numericas(df):
"""
Crea un histograma y un diagrama de caja para cada columna numérica en el DataFrame,
con títulos mejorados en 3 líneas y una paleta de colores inspirada en la música turca.
"""
# Seleccionar solo columnas numéricas
df_num = df.select_dtypes(include=['number'])
if df_num.empty:
print("No se encontraron columnas numéricas para visualizar.")
return
print("Visualización de Distribuciones Numéricas:")
# Paleta de colores inspirada en la música turca
colores_turcos = ["#E9A000", "#1C4E80", "#A52A2A", "#8B4513", "#4A8C80", "#D36E1B", "#663399"]
for i, col in enumerate(df_num.columns):
# AQUÍ ESTÁ EL CAMBIO: figsize aumentado para más espacio
fig, axes = plt.subplots(1, 2, figsize=(10.5, 3.75))
color_base = colores_turcos[i % len(colores_turcos)]
# --- Histograma con línea de densidad (KDE) ---
sns.histplot(df[col], kde=True, ax=axes[0], bins=30, color=color_base, edgecolor='white')
# TÍTULO MEJORADO EN TRES LÍNEAS
axes[0].set_title(f'Distribución de la Variable\n"{col}"\n(Histograma y KDE)', fontsize=10)
mean_val = df[col].mean()
median_val = df[col].median()
axes[0].axvline(mean_val, color='#C0C0C0', linestyle='--', linewidth=2, label=f'Media: {mean_val:.2f}')
axes[0].axvline(median_val, color='#FFD700', linestyle='-', linewidth=2, label=f'Mediana: {median_val:.2f}')
axes[0].legend(fontsize='small')
axes[0].set_xlabel(col, fontsize=8)
axes[0].set_ylabel('Frecuencia', fontsize=8)
# --- Diagrama de caja (Box Plot) ---
sns.boxplot(x=df[col], ax=axes[1], color=color_base, flierprops=dict(markerfacecolor='#B22222', marker='D'))
# TÍTULO MEJORADO EN TRES LÍNEAS
axes[1].set_title(f'Análisis de Dispersión\n"{col}"\n(Box Plot y Outliers)', fontsize=10)
axes[1].set_xlabel(col, fontsize=8)
# Aumentamos el padding para dar más aire a los títulos
plt.tight_layout(pad=2.0)
plt.show()
# --- Llama a la función para ver los gráficos ---
visualizar_distribuciones_numericas(df)
Visualización de Distribuciones Numéricas:
Valores únicos por variable para identificar posibles variables categóricas
def analizar_cardinalidad(df):
"""
Analiza y muestra la cardinalidad (valores únicos) de cada columna
de una manera visualmente impactante.
"""
# Crear un DataFrame con los conteos y porcentajes de valores únicos
cardinalidad_df = pd.DataFrame({
'Valores Únicos': df.nunique(),
'Cardinalidad (%)': round(df.nunique() * 100 / len(df), 2)
}).sort_values(by='Cardinalidad (%)', ascending=False)
print(f"Análisis de Cardinalidad para {len(df.columns)} columnas:")
# Aplicar estilo para un análisis visual rápido
styled_df = (cardinalidad_df.style
.background_gradient(cmap='magma', subset=['Cardinalidad (%)'])
.format({'Cardinalidad (%)': '{:.2f}%'})
.bar(subset=['Valores Únicos'], color='#5fba7d', align='zero')
.set_caption("Unicidad y Cardinalidad de las Columnas")
)
return styled_df
# --- Llama a la función para un análisis enfocado en la cardinalidad ---
analizar_cardinalidad(df)
Análisis de Cardinalidad para 51 columnas:
| Valores Únicos | Cardinalidad (%) | |
|---|---|---|
| _Spectralcentroid_Mean | 388 | 97.00% |
| _Tempo_Mean | 388 | 97.00% |
| _Roughness_Mean | 388 | 97.00% |
| _Spectralspread_Mean | 388 | 97.00% |
| _Rolloff_Mean | 388 | 97.00% |
| _Zero-crossingrate_Mean | 388 | 97.00% |
| _Spectralkurtosis_Mean | 381 | 95.25% |
| _Fluctuation_Mean | 377 | 94.25% |
| _Spectralskewness_Mean | 357 | 89.25% |
| _MFCC_Mean_1 | 354 | 88.50% |
| _MFCC_Mean_2 | 347 | 86.75% |
| _MFCC_Mean_3 | 319 | 79.75% |
| _MFCC_Mean_4 | 316 | 79.00% |
| _MFCC_Mean_7 | 304 | 76.00% |
| _MFCC_Mean_5 | 297 | 74.25% |
| _MFCC_Mean_6 | 297 | 74.25% |
| _Roughness_Slope | 292 | 73.00% |
| _Chromagram_Mean_5 | 286 | 71.50% |
| _MFCC_Mean_9 | 278 | 69.50% |
| _Brightness_Mean | 277 | 69.25% |
| _AttackTime_Slope | 274 | 68.50% |
| _Chromagram_Mean_11 | 273 | 68.25% |
| _MFCC_Mean_8 | 273 | 68.25% |
| _MFCC_Mean_12 | 272 | 68.00% |
| _MFCC_Mean_10 | 271 | 67.75% |
| _Chromagram_Mean_8 | 269 | 67.25% |
| _Pulseclarity_Mean | 266 | 66.50% |
| _Chromagram_Mean_3 | 263 | 65.75% |
| _Chromagram_Mean_9 | 262 | 65.50% |
| _MFCC_Mean_13 | 259 | 64.75% |
| _Chromagram_Mean_1 | 259 | 64.75% |
| _MFCC_Mean_11 | 253 | 63.25% |
| _Chromagram_Mean_12 | 252 | 63.00% |
| _Chromagram_Mean_10 | 245 | 61.25% |
| _Chromagram_Mean_7 | 243 | 60.75% |
| _Chromagram_Mean_6 | 241 | 60.25% |
| _Chromagram_Mean_2 | 240 | 60.00% |
| _Chromagram_Mean_4 | 240 | 60.00% |
| _HarmonicChangeDetectionFunction_Slope | 237 | 59.25% |
| _RMSenergy_Mean | 196 | 49.00% |
| _HarmonicChangeDetectionFunction_PeriodAmp | 196 | 49.00% |
| _HarmonicChangeDetectionFunction_Mean | 178 | 44.50% |
| _Lowenergy_Mean | 166 | 41.50% |
| _Eventdensity_Mean | 163 | 40.75% |
| _HarmonicChangeDetectionFunction_Std | 159 | 39.75% |
| _EntropyofSpectrum_Mean | 134 | 33.50% |
| _Spectralflatness_Mean | 94 | 23.50% |
| _AttackTime_Mean | 61 | 15.25% |
| _HarmonicChangeDetectionFunction_PeriodFreq | 40 | 10.00% |
| _HarmonicChangeDetectionFunction_PeriodEntropy | 26 | 6.50% |
| Class | 4 | 1.00% |
Búsqueda de valores faltantes
def analizar_datos_faltantes(df):
"""
Analiza y muestra los datos faltantes de una manera impactante, mostrando
únicamente las columnas que contienen valores nulos.
"""
# Calcula el conteo y porcentaje de nulos por columna
valores_faltantes = df.isnull().sum()
porcentaje_faltante = round(valores_faltantes * 100 / len(df), 2)
# Crea un DataFrame con los resultados
resumen_faltantes = pd.DataFrame({
'Valores Faltantes': valores_faltantes,
'Porcentaje (%)': porcentaje_faltante
})
# Filtra para mostrar solo columnas con datos faltantes y ordena de mayor a menor
resumen_faltantes = resumen_faltantes[resumen_faltantes['Valores Faltantes'] > 0].sort_values(
by='Porcentaje (%)', ascending=False
)
# Si el DataFrame resultante está vacío, ¡felicidades!
if resumen_faltantes.empty:
print("🎉 ¡Excelente! No se encontraron datos faltantes en el DataFrame.")
return
print(f"Análisis de Datos Faltantes: {len(resumen_faltantes)} de {len(df.columns)} columnas tienen valores nulos.")
# Aplica estilo para resaltar los problemas
styled_resumen = (resumen_faltantes.style
.background_gradient(cmap='Reds', subset=['Porcentaje (%)'])
.format({'Porcentaje (%)': '{:.2f}%'})
.bar(subset=['Valores Faltantes'], color='#d65f5f', align='zero')
.set_caption("Columnas con Datos Faltantes")
)
return styled_resumen
# --- Llama a la función para un diagnóstico preciso de datos faltantes ---
analizar_datos_faltantes(df)
🎉 ¡Excelente! No se encontraron datos faltantes en el DataFrame.
Diagrama de barras para determinar la frecuencia de las emociones
sns.countplot(x='Class', data=df)
plt.show()
Histogramas para cada característica numérica, para ver qué tan equilibrados están los datos
hist = df.hist(bins=30, sharey=True, figsize=(16, 12))
plt.tight_layout(pad=3.0)
plt.show()
Q - Q plots para ver la distribución después de haber "normalizado" los datos
variables_trans = df.columns.to_list()
variables_trans.remove("Class")
transformer = PowerTransformer(method="yeo-johnson", standardize=False)
transformer.fit(df[variables_trans])
PowerTransformer(standardize=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| method | 'yeo-johnson' | |
| standardize | False | |
| copy | True |
df[variables_trans] = transformer.transform(df[variables_trans])
n = len(variables_trans)
ncols = 8
nrows = math.ceil(n / ncols)
# ---------- FIGURA 1: HISTOGRAMAS ----------
fig1, axes1 = plt.subplots(nrows, ncols, figsize=(ncols*2.0, nrows*1.8)) # plots más compactos
axes1 = axes1.flatten() if nrows*ncols > 1 else [axes1]
for i, col in enumerate(variables_trans):
ax = axes1[i]
data = df[col].dropna()
ax.hist(data, bins=20, edgecolor='white')
ax.set_title(col, fontsize=9)
ax.tick_params(labelsize=8)
# apaga ejes sobrantes si hay
for j in range(i+1, nrows*ncols):
axes1[j].axis('off')
fig1.suptitle('Histogramas', fontsize=12, y=1.02)
fig1.tight_layout()
plt.show()
# ---------- FIGURA 2: Q-Q PLOTS ----------
fig2, axes2 = plt.subplots(nrows, ncols, figsize=(ncols*2.0, nrows*1.8))
axes2 = axes2.flatten() if nrows*ncols > 1 else [axes2]
for i, col in enumerate(variables_trans):
ax = axes2[i]
data = df[col].dropna()
stats.probplot(data, dist="norm", plot=ax)
ax.set_title(col, fontsize=9)
ax.tick_params(labelsize=8)
for j in range(i+1, nrows*ncols):
axes2[j].axis('off')
fig2.suptitle('Q-Q plots vs Normal', fontsize=12, y=1.02)
fig2.tight_layout()
plt.show()
Tablas de frecuencia para cada característica categórica
for column in df.select_dtypes(include=[ 'object', 'bool']).columns:
display(column, pd.crosstab(index=df[column], columns='% observations', normalize='columns') * 100)
'Class'
| col_0 | % observations |
|---|---|
| Class | |
| angry | 25.0 |
| happy | 25.0 |
| relax | 25.0 |
| sad | 25.0 |
label_encoder = LabelEncoder()
df["Class"] = label_encoder.fit_transform(df['Class'])
df
| Class | _RMSenergy_Mean | _Lowenergy_Mean | _Fluctuation_Mean | _Tempo_Mean | _MFCC_Mean_1 | _MFCC_Mean_2 | _MFCC_Mean_3 | _MFCC_Mean_4 | _MFCC_Mean_5 | ... | _Chromagram_Mean_9 | _Chromagram_Mean_10 | _Chromagram_Mean_11 | _Chromagram_Mean_12 | _HarmonicChangeDetectionFunction_Mean | _HarmonicChangeDetectionFunction_Std | _HarmonicChangeDetectionFunction_Slope | _HarmonicChangeDetectionFunction_PeriodFreq | _HarmonicChangeDetectionFunction_PeriodAmp | _HarmonicChangeDetectionFunction_PeriodEntropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0.046255 | 1.147712 | 0.785693 | 49.459481 | 1.883949 | 0.377859 | 0.865352 | 0.079494 | 0.219944 | ... | 0.269816 | 1.136109 | 0.007925 | 0.091497 | 0.537622 | 0.200511 | 0.017918 | 0.807322 | 6.762849 | 5.938439e+30 |
| 1 | 2 | 0.095623 | 0.727046 | 0.764910 | 52.943619 | 1.900475 | 0.545123 | 0.767637 | 0.433917 | 0.549911 | ... | 0.001995 | 1.136109 | -0.000000 | 0.487793 | 0.460935 | 0.169693 | -0.083691 | 1.931502 | 12.210196 | 5.010738e+30 |
| 2 | 2 | 0.041457 | 1.302833 | 0.793466 | 65.444342 | 1.512479 | 0.986259 | 0.494373 | 0.354595 | 0.285253 | ... | 0.147740 | 0.824692 | 0.015702 | 0.491675 | 0.821232 | 0.222127 | 0.129709 | 1.179722 | 11.586369 | 3.993609e+30 |
| 3 | 2 | 0.101274 | 1.185439 | 0.792824 | 29.467297 | 1.534865 | 1.775386 | 0.600987 | 0.380042 | 0.010997 | ... | 0.036190 | 1.136109 | 0.134976 | 0.424863 | 0.851187 | 0.202856 | 0.041560 | 0.319833 | 15.087478 | 5.302789e+30 |
| 4 | 2 | 0.056968 | 1.147712 | 0.789386 | 37.016335 | 1.656768 | 0.234033 | 0.795458 | 0.098256 | 0.430168 | ... | 0.003979 | 0.428451 | 0.447120 | 0.000999 | 0.615226 | 0.200511 | 0.087067 | 0.617198 | 10.533691 | 2.839124e+30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 395 | 0 | 0.121233 | 1.107620 | 0.744632 | 58.198474 | 1.582646 | 0.065509 | 0.703238 | 0.046522 | 0.263501 | ... | 0.248283 | 0.935896 | 0.275238 | 0.110760 | 0.555823 | 0.120580 | 0.116536 | 1.658183 | 27.746435 | 5.611702e+30 |
| 396 | 0 | 0.122175 | 0.878151 | 0.740478 | 63.229758 | 1.517594 | -0.145470 | 0.338298 | -0.010970 | 0.028981 | ... | 0.019488 | 1.136109 | 0.359050 | 0.009898 | 0.345856 | 0.110819 | 0.140000 | 1.931502 | 29.365195 | 5.010738e+30 |
| 397 | 0 | 0.127222 | 1.044541 | 0.733938 | 50.607703 | 1.081312 | 0.600708 | 0.858658 | -0.109992 | 0.242721 | ... | 0.048665 | 0.189290 | 0.213246 | 0.091497 | 0.423771 | 0.132968 | 0.108024 | 1.931502 | 22.030962 | 3.773116e+30 |
| 398 | 0 | 0.104013 | 1.092413 | 0.728113 | 44.628036 | 1.221377 | -0.205021 | 0.680127 | 0.090942 | 0.205078 | ... | 0.115906 | 1.136109 | 0.222443 | 0.122376 | 0.442152 | 0.123533 | 0.060080 | 1.931502 | 21.187036 | 5.611702e+30 |
| 399 | 0 | 0.071178 | 0.817475 | 0.746010 | 55.607331 | 1.318279 | -0.013976 | 0.814626 | -0.020891 | 0.342518 | ... | 0.087529 | 1.136109 | 0.084520 | 0.031917 | 0.271693 | 0.097658 | 0.006988 | 0.541146 | 25.344224 | 4.473616e+30 |
400 rows × 51 columns
df.corr()
| Class | _RMSenergy_Mean | _Lowenergy_Mean | _Fluctuation_Mean | _Tempo_Mean | _MFCC_Mean_1 | _MFCC_Mean_2 | _MFCC_Mean_3 | _MFCC_Mean_4 | _MFCC_Mean_5 | ... | _Chromagram_Mean_9 | _Chromagram_Mean_10 | _Chromagram_Mean_11 | _Chromagram_Mean_12 | _HarmonicChangeDetectionFunction_Mean | _HarmonicChangeDetectionFunction_Std | _HarmonicChangeDetectionFunction_Slope | _HarmonicChangeDetectionFunction_PeriodFreq | _HarmonicChangeDetectionFunction_PeriodAmp | _HarmonicChangeDetectionFunction_PeriodEntropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Class | 1.000000 | -0.300955 | 0.174327 | 0.392882 | -0.056205 | 0.362134 | 0.068330 | -0.118221 | -0.050932 | -0.045488 | ... | 0.028964 | -0.111679 | -0.054256 | 0.044204 | 0.314104 | 0.669975 | 0.019715 | -0.201242 | -0.597808 | -0.063368 |
| _RMSenergy_Mean | -0.300955 | 1.000000 | -0.281635 | -0.177114 | -0.014712 | -0.171867 | -0.025694 | 0.075546 | 0.004651 | -0.049314 | ... | 0.128949 | 0.131721 | 0.163869 | 0.029282 | -0.014093 | -0.369574 | -0.091747 | 0.120925 | 0.355068 | -0.032966 |
| _Lowenergy_Mean | 0.174327 | -0.281635 | 1.000000 | 0.155663 | -0.042172 | 0.118139 | 0.133686 | -0.067116 | 0.056634 | 0.018319 | ... | 0.074131 | -0.025039 | 0.066540 | -0.034081 | 0.023246 | 0.207735 | 0.223859 | -0.088219 | -0.196982 | -0.021062 |
| _Fluctuation_Mean | 0.392882 | -0.177114 | 0.155663 | 1.000000 | -0.106790 | 0.124968 | 0.147555 | -0.130878 | 0.120289 | -0.055065 | ... | -0.028560 | -0.083241 | 0.001803 | 0.060133 | 0.338422 | 0.452092 | 0.104434 | -0.069151 | -0.306285 | -0.040611 |
| _Tempo_Mean | -0.056205 | -0.014712 | -0.042172 | -0.106790 | 1.000000 | -0.075108 | 0.083978 | 0.017875 | 0.025673 | 0.072101 | ... | 0.060861 | 0.059757 | 0.035827 | 0.015775 | -0.089613 | -0.137069 | -0.058962 | 0.032283 | 0.059248 | 0.116209 |
| _MFCC_Mean_1 | 0.362134 | -0.171867 | 0.118139 | 0.124968 | -0.075108 | 1.000000 | 0.040707 | 0.067129 | 0.049199 | -0.107084 | ... | -0.177787 | -0.094558 | -0.177474 | -0.060446 | -0.004390 | 0.398240 | 0.020684 | -0.126185 | -0.431880 | -0.072597 |
| _MFCC_Mean_2 | 0.068330 | -0.025694 | 0.133686 | 0.147555 | 0.083978 | 0.040707 | 1.000000 | 0.047537 | 0.358970 | 0.177046 | ... | 0.003020 | 0.009304 | -0.117024 | -0.025309 | -0.041602 | 0.136278 | 0.147273 | -0.052555 | -0.165532 | 0.033975 |
| _MFCC_Mean_3 | -0.118221 | 0.075546 | -0.067116 | -0.130878 | 0.017875 | 0.067129 | 0.047537 | 1.000000 | 0.190250 | 0.104222 | ... | -0.079648 | 0.009654 | -0.058884 | -0.023069 | -0.157477 | -0.079437 | 0.076963 | -0.040040 | -0.026490 | -0.058232 |
| _MFCC_Mean_4 | -0.050932 | 0.004651 | 0.056634 | 0.120289 | 0.025673 | 0.049199 | 0.358970 | 0.190250 | 1.000000 | 0.272867 | ... | -0.047560 | 0.062361 | -0.090089 | -0.025413 | -0.075269 | -0.004211 | 0.127470 | -0.075101 | -0.008205 | -0.092063 |
| _MFCC_Mean_5 | -0.045488 | -0.049314 | 0.018319 | -0.055065 | 0.072101 | -0.107084 | 0.177046 | 0.104222 | 0.272867 | 1.000000 | ... | 0.057164 | 0.005268 | 0.023767 | 0.030171 | -0.085922 | -0.104762 | 0.037293 | 0.027463 | 0.062554 | 0.013751 |
| _MFCC_Mean_6 | -0.037669 | -0.038272 | -0.006840 | -0.112516 | 0.064635 | 0.063804 | 0.181029 | 0.121178 | 0.337789 | 0.379467 | ... | -0.018752 | 0.068534 | -0.040221 | -0.080736 | -0.119552 | -0.078788 | -0.011430 | -0.051585 | 0.040636 | -0.067341 |
| _MFCC_Mean_7 | -0.044928 | -0.073890 | -0.007664 | -0.204872 | 0.109782 | -0.033569 | 0.057125 | 0.097992 | 0.055655 | 0.105031 | ... | 0.065463 | 0.078696 | -0.025926 | -0.092893 | -0.142082 | -0.126186 | 0.014191 | -0.080202 | 0.030478 | 0.041139 |
| _MFCC_Mean_8 | 0.029137 | -0.028576 | 0.002668 | -0.009350 | 0.084302 | 0.023452 | 0.006055 | 0.048145 | -0.048309 | -0.006843 | ... | 0.105767 | 0.045640 | -0.016686 | -0.089824 | -0.054623 | -0.065312 | -0.082687 | -0.004595 | 0.004550 | 0.004134 |
| _MFCC_Mean_9 | 0.032591 | -0.061882 | 0.139184 | -0.019740 | 0.089347 | 0.018528 | 0.133188 | -0.023552 | 0.040513 | -0.043807 | ... | 0.011716 | -0.005635 | 0.017750 | -0.119475 | -0.086143 | 0.017109 | 0.091577 | -0.070629 | -0.074413 | -0.028158 |
| _MFCC_Mean_10 | -0.020621 | -0.024011 | 0.061075 | 0.092923 | 0.089506 | -0.045888 | 0.054751 | 0.006540 | 0.005815 | 0.044985 | ... | 0.000884 | -0.037011 | 0.001084 | -0.096384 | -0.044083 | 0.041982 | 0.036067 | -0.047205 | -0.047720 | 0.012345 |
| _MFCC_Mean_11 | -0.075638 | 0.028033 | -0.062946 | -0.045076 | 0.066297 | -0.160400 | 0.114932 | -0.119775 | -0.011914 | 0.098805 | ... | 0.014056 | 0.030343 | 0.056202 | 0.065739 | 0.003975 | -0.022370 | -0.059868 | -0.009596 | 0.085333 | 0.069530 |
| _MFCC_Mean_12 | 0.028299 | -0.040536 | 0.060556 | 0.047515 | 0.022166 | 0.016254 | 0.005449 | 0.003575 | -0.034378 | -0.072253 | ... | -0.052725 | -0.008822 | 0.024351 | 0.098350 | 0.098797 | 0.082483 | 0.013647 | -0.019403 | -0.011682 | 0.079298 |
| _MFCC_Mean_13 | -0.067013 | -0.067257 | 0.103279 | -0.047603 | 0.068089 | 0.050632 | 0.133568 | 0.031047 | -0.002487 | 0.058814 | ... | -0.086168 | 0.039328 | 0.008362 | 0.132387 | -0.026787 | 0.036666 | -0.010579 | 0.050582 | 0.005560 | 0.100646 |
| _Roughness_Mean | -0.394106 | 0.922060 | -0.354121 | -0.247770 | 0.022865 | -0.276713 | -0.090417 | 0.080286 | -0.045444 | -0.019835 | ... | 0.127504 | 0.154507 | 0.183550 | 0.057615 | 0.003768 | -0.459690 | -0.097954 | 0.120594 | 0.473365 | 0.000949 |
| _Roughness_Slope | 0.144049 | -0.101701 | 0.035621 | 0.050208 | 0.036830 | 0.084562 | 0.104972 | -0.026281 | 0.062458 | -0.005208 | ... | -0.047285 | 0.024319 | -0.088668 | 0.103516 | 0.109817 | 0.220078 | 0.086695 | -0.086702 | -0.164752 | 0.029399 |
| _Zero-crossingrate_Mean | -0.074970 | 0.026139 | -0.015567 | 0.069289 | 0.004459 | -0.682747 | -0.288590 | -0.310513 | -0.263036 | -0.112658 | ... | 0.205742 | 0.040806 | 0.255343 | 0.179463 | 0.235178 | -0.174276 | -0.073436 | 0.086946 | 0.331018 | 0.064745 |
| _AttackTime_Mean | 0.480993 | -0.444078 | 0.459084 | 0.410047 | -0.092952 | 0.192544 | 0.053319 | -0.122228 | -0.092701 | -0.158953 | ... | -0.051039 | -0.101806 | -0.091793 | -0.059036 | 0.079447 | 0.500102 | 0.153832 | -0.122243 | -0.504521 | -0.095227 |
| _AttackTime_Slope | -0.110661 | 0.080719 | -0.040061 | -0.041544 | -0.160343 | 0.003024 | -0.204739 | -0.010913 | -0.025485 | 0.057418 | ... | 0.043832 | -0.060792 | 0.096816 | 0.071297 | 0.020882 | -0.094326 | -0.090617 | 0.089473 | 0.112861 | 0.003476 |
| _Rolloff_Mean | -0.156709 | 0.164842 | 0.040105 | 0.110152 | 0.073996 | -0.620981 | 0.278365 | -0.251943 | 0.125691 | 0.008531 | ... | 0.199541 | 0.136537 | 0.173243 | 0.109965 | 0.100547 | -0.173244 | -0.002296 | 0.070895 | 0.257103 | 0.038293 |
| _Eventdensity_Mean | -0.484784 | 0.450370 | -0.490128 | -0.275265 | 0.101036 | -0.448766 | -0.205635 | -0.048297 | -0.063492 | 0.014219 | ... | 0.168344 | 0.100157 | 0.200335 | 0.065604 | 0.052560 | -0.630868 | -0.107198 | 0.262463 | 0.681950 | 0.128439 |
| _Pulseclarity_Mean | -0.372102 | 0.189068 | 0.011201 | 0.033147 | 0.117209 | -0.309492 | -0.066375 | -0.173008 | -0.042196 | -0.046745 | ... | 0.176810 | 0.057012 | 0.181230 | 0.070518 | 0.030180 | -0.392775 | 0.032244 | 0.156389 | 0.467441 | 0.151976 |
| _Brightness_Mean | -0.224204 | 0.118511 | -0.069372 | 0.003776 | 0.045190 | -0.896142 | -0.188497 | -0.167434 | -0.135427 | -0.035968 | ... | 0.193288 | 0.080318 | 0.211502 | 0.126747 | 0.127376 | -0.279805 | -0.044780 | 0.109403 | 0.377307 | 0.051659 |
| _Spectralcentroid_Mean | -0.167885 | 0.154831 | 0.008884 | 0.088264 | 0.065080 | -0.739102 | 0.141171 | -0.287106 | 0.028881 | -0.023646 | ... | 0.208623 | 0.116180 | 0.191573 | 0.125356 | 0.118459 | -0.206763 | -0.011408 | 0.085939 | 0.304879 | 0.038277 |
| _Spectralspread_Mean | -0.122575 | 0.197366 | 0.054832 | 0.098471 | 0.067792 | -0.437120 | 0.373816 | -0.221671 | 0.192448 | -0.017684 | ... | 0.173048 | 0.151266 | 0.135917 | 0.102115 | 0.063392 | -0.137951 | 0.018140 | 0.041517 | 0.201810 | 0.008683 |
| _Spectralskewness_Mean | 0.201855 | -0.144788 | 0.012160 | -0.083325 | -0.090589 | 0.785463 | -0.227533 | 0.233059 | -0.099461 | -0.055843 | ... | -0.221493 | -0.108266 | -0.161126 | -0.090343 | -0.099755 | 0.217282 | -0.009631 | -0.084267 | -0.304221 | -0.062772 |
| _Spectralkurtosis_Mean | 0.178580 | -0.160859 | -0.013604 | -0.120152 | -0.092898 | 0.702285 | -0.303576 | 0.250561 | -0.159243 | -0.040663 | ... | -0.215957 | -0.121465 | -0.155683 | -0.099912 | -0.105316 | 0.180356 | -0.011200 | -0.066685 | -0.272795 | -0.054307 |
| _Spectralflatness_Mean | -0.058126 | 0.063555 | 0.089920 | 0.043057 | 0.048616 | -0.304261 | 0.299231 | -0.193467 | 0.085180 | -0.059916 | ... | 0.084293 | 0.077582 | 0.068592 | 0.053399 | -0.026120 | -0.089562 | 0.026035 | 0.044578 | 0.086967 | -0.000055 |
| _EntropyofSpectrum_Mean | -0.219075 | 0.170018 | -0.042593 | 0.033299 | 0.057302 | -0.774152 | -0.081432 | -0.295714 | -0.096215 | -0.043524 | ... | 0.243881 | 0.105560 | 0.265868 | 0.175020 | 0.173405 | -0.307949 | -0.052169 | 0.113522 | 0.433486 | 0.044032 |
| _Chromagram_Mean_1 | -0.032565 | 0.035289 | -0.002613 | 0.094381 | 0.060964 | -0.064125 | -0.058609 | -0.134867 | -0.054657 | 0.084882 | ... | 0.093008 | -0.178847 | 0.353609 | 0.349807 | 0.257996 | -0.038977 | -0.115735 | 0.051425 | 0.221190 | 0.001559 |
| _Chromagram_Mean_2 | -0.033956 | 0.071248 | -0.042269 | 0.011150 | 0.031261 | -0.121695 | -0.140485 | -0.100036 | -0.130305 | 0.025397 | ... | 0.444471 | -0.090557 | 0.256742 | 0.316175 | 0.310582 | -0.019188 | -0.099637 | -0.013200 | 0.238698 | 0.060834 |
| _Chromagram_Mean_3 | -0.117814 | 0.096670 | -0.035443 | -0.010909 | 0.011636 | -0.010620 | -0.050730 | -0.074067 | -0.012424 | 0.020332 | ... | -0.073335 | 0.108335 | 0.165910 | -0.094055 | 0.070674 | -0.084338 | -0.111410 | 0.014367 | 0.136154 | 0.037803 |
| _Chromagram_Mean_4 | -0.035852 | 0.092103 | -0.004291 | 0.014159 | -0.049349 | -0.092652 | -0.087213 | -0.116014 | -0.129973 | -0.025885 | ... | 0.409477 | -0.195227 | 0.430896 | 0.156868 | 0.257533 | -0.059484 | -0.090352 | -0.013615 | 0.232225 | -0.029288 |
| _Chromagram_Mean_5 | 0.031225 | 0.019026 | -0.101486 | 0.038371 | 0.052184 | -0.022024 | -0.099362 | -0.070167 | -0.092000 | -0.078961 | ... | -0.005299 | 0.103886 | -0.125068 | 0.468105 | 0.117932 | 0.022755 | -0.031876 | 0.016427 | 0.057214 | -0.005967 |
| _Chromagram_Mean_6 | 0.069666 | -0.017463 | 0.059919 | 0.060916 | 0.007653 | -0.064415 | -0.146435 | -0.045886 | -0.133151 | 0.014098 | ... | 0.228630 | -0.082557 | 0.397089 | -0.091050 | 0.245944 | -0.004566 | 0.011835 | 0.074252 | 0.126667 | -0.078459 |
| _Chromagram_Mean_7 | 0.252600 | -0.084324 | 0.034529 | 0.207115 | -0.022048 | -0.015167 | -0.041778 | -0.086944 | -0.090960 | -0.028213 | ... | 0.267587 | -0.134059 | 0.047911 | 0.492224 | 0.359056 | 0.174767 | -0.033643 | -0.033668 | -0.002329 | -0.018215 |
| _Chromagram_Mean_8 | 0.136248 | 0.008533 | 0.031312 | 0.169993 | 0.015751 | -0.067079 | 0.059728 | -0.048600 | 0.035155 | -0.019915 | ... | 0.295011 | -0.065163 | 0.278067 | -0.002241 | 0.314400 | 0.058605 | -0.058629 | 0.077626 | 0.097986 | -0.042300 |
| _Chromagram_Mean_9 | 0.028964 | 0.128949 | 0.074131 | -0.028560 | 0.060861 | -0.177787 | 0.003020 | -0.079648 | -0.047560 | 0.057164 | ... | 1.000000 | 0.062536 | 0.368510 | 0.136012 | 0.285409 | -0.104406 | -0.051001 | 0.068966 | 0.254790 | 0.048431 |
| _Chromagram_Mean_10 | -0.111679 | 0.131721 | -0.025039 | -0.083241 | 0.059757 | -0.094558 | 0.009304 | 0.009654 | 0.062361 | 0.005268 | ... | 0.062536 | 1.000000 | 0.064108 | -0.046829 | -0.120855 | -0.144962 | 0.102380 | 0.000520 | 0.077126 | 0.007735 |
| _Chromagram_Mean_11 | -0.054256 | 0.163869 | 0.066540 | 0.001803 | 0.035827 | -0.177474 | -0.117024 | -0.058884 | -0.090089 | 0.023767 | ... | 0.368510 | 0.064108 | 1.000000 | 0.074993 | 0.227201 | -0.161698 | -0.091722 | 0.008532 | 0.303412 | -0.013759 |
| _Chromagram_Mean_12 | 0.044204 | 0.029282 | -0.034081 | 0.060133 | 0.015775 | -0.060446 | -0.025309 | -0.023069 | -0.025413 | 0.030171 | ... | 0.136012 | -0.046829 | 0.074993 | 1.000000 | 0.190464 | 0.047696 | -0.050764 | -0.001428 | 0.090829 | -0.022490 |
| _HarmonicChangeDetectionFunction_Mean | 0.314104 | -0.014093 | 0.023246 | 0.338422 | -0.089613 | -0.004390 | -0.041602 | -0.157477 | -0.075269 | -0.085922 | ... | 0.285409 | -0.120855 | 0.227201 | 0.190464 | 1.000000 | 0.427895 | -0.138121 | 0.046846 | 0.122117 | 0.075975 |
| _HarmonicChangeDetectionFunction_Std | 0.669975 | -0.369574 | 0.207735 | 0.452092 | -0.137069 | 0.398240 | 0.136278 | -0.079437 | -0.004211 | -0.104762 | ... | -0.104406 | -0.144962 | -0.161698 | 0.047696 | 0.427895 | 1.000000 | 0.027965 | -0.260338 | -0.752416 | -0.130781 |
| _HarmonicChangeDetectionFunction_Slope | 0.019715 | -0.091747 | 0.223859 | 0.104434 | -0.058962 | 0.020684 | 0.147273 | 0.076963 | 0.127470 | 0.037293 | ... | -0.051001 | 0.102380 | -0.091722 | -0.050764 | -0.138121 | 0.027965 | 1.000000 | -0.056598 | -0.105679 | -0.154811 |
| _HarmonicChangeDetectionFunction_PeriodFreq | -0.201242 | 0.120925 | -0.088219 | -0.069151 | 0.032283 | -0.126185 | -0.052555 | -0.040040 | -0.075101 | 0.027463 | ... | 0.068966 | 0.000520 | 0.008532 | -0.001428 | 0.046846 | -0.260338 | -0.056598 | 1.000000 | 0.363481 | 0.077017 |
| _HarmonicChangeDetectionFunction_PeriodAmp | -0.597808 | 0.355068 | -0.196982 | -0.306285 | 0.059248 | -0.431880 | -0.165532 | -0.026490 | -0.008205 | 0.062554 | ... | 0.254790 | 0.077126 | 0.303412 | 0.090829 | 0.122117 | -0.752416 | -0.105679 | 0.363481 | 1.000000 | 0.074684 |
| _HarmonicChangeDetectionFunction_PeriodEntropy | -0.063368 | -0.032966 | -0.021062 | -0.040611 | 0.116209 | -0.072597 | 0.033975 | -0.058232 | -0.092063 | 0.013751 | ... | 0.048431 | 0.007735 | -0.013759 | -0.022490 | 0.075975 | -0.130781 | -0.154811 | 0.077017 | 0.074684 | 1.000000 |
51 rows × 51 columns
mapa de calor que cuantifique la correlación de las variables numéricas en el dataframe
data_df_numeric = df.select_dtypes(include=[np.number])
data_df_corr = data_df_numeric.corr()
plt.figure(figsize=(18,12))
sns.heatmap(data_df_corr, annot=True, linewidths=0.5)
plt.show()
Ingeniería de características¶
Media, mediana, desviación estándar por clase
if 'Class' in df.columns:
numeric_cols = df.select_dtypes(include=[float, int]).columns
numeric_cols = numeric_cols.drop('Class', errors='ignore')
for col in numeric_cols[:5]:
plt.figure(figsize=(6, 3))
sns.boxplot(x='Class', y=col, data=df)
plt.title(f"Boxplot de {col} por clase")
plt.tight_layout()
plt.show()
else:
print("La columna 'Class' no existe en el DataFrame.")
Histogramas
for col in numeric_cols[:5]:
plt.figure(figsize=(5,3))
sns.kdeplot(data=df, x=col, hue="Class", fill=True)
plt.title(f"Densidad de {col}")
plt.tight_layout()
plt.show()
Correlaciones
Detección de valores atípicos (outliers)¶
Z‑score: marcar valores cuyo z > 3 o < −3:
z_scores = np.abs(stats.zscore(df[numeric_cols], nan_policy='omit'))
outlier_mask = (z_scores > 3)
outlier_counts = outlier_mask.sum(axis=0)
outlier_counts_series = pd.Series(outlier_counts, index=numeric_cols)
outlier_counts_series.sort_values(ascending=False).head(10)
_MFCC_Mean_6 8 _MFCC_Mean_8 6 _MFCC_Mean_10 5 _Lowenergy_Mean 4 _MFCC_Mean_2 4 _HarmonicChangeDetectionFunction_PeriodEntropy 4 _MFCC_Mean_4 3 _MFCC_Mean_5 3 _MFCC_Mean_3 3 _MFCC_Mean_11 3 dtype: int64
IQR (Interquartile Range)
Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1
is_outlier = ((df[numeric_cols] < (Q1 - 1.5 * IQR)) |
(df[numeric_cols] > (Q3 + 1.5 * IQR)))
# cuenta
is_outlier.sum().sort_values(ascending=False).head(10)
_MFCC_Mean_6 17 _Fluctuation_Mean 16 _HarmonicChangeDetectionFunction_PeriodEntropy 15 _MFCC_Mean_13 14 _MFCC_Mean_10 13 _MFCC_Mean_1 13 _AttackTime_Mean 13 _MFCC_Mean_8 12 _MFCC_Mean_9 11 _MFCC_Mean_5 11 dtype: int64
Limpieza y tratamiento de datos¶
Asegurar que la columna Class solo contiene las etiquetas esperadas y asegurar que las columnas numéricas no tienen valores “inf“, “NaN” u otros artefactos
print(df["Class"].unique())
[2 1 3 0]
Revisar duplicados
dup_count = df.duplicated().sum()
print("Duplicados:", dup_count)
if dup_count > 0:
df = df.drop_duplicates()
Duplicados: 12
TODO: Eliminar columnas constantes (desviación cero)
stds = df[numeric_cols].std()
zero_std = stds[stds == 0].index.tolist()
print("Columnas constantes:", zero_std)
df = df.drop(columns=zero_std)
Columnas constantes: []
Version dataset limpio
df_original = df.to_csv("turkish_music_emotion_raw.csv", index=False)
df.to_csv("turkish_music_emotion_cleaned.csv", index=False)
X = df.loc[:,df.columns != "Class"]
Y = df.loc[:,df.columns == "Class"]
X
| _RMSenergy_Mean | _Lowenergy_Mean | _Fluctuation_Mean | _Tempo_Mean | _MFCC_Mean_1 | _MFCC_Mean_2 | _MFCC_Mean_3 | _MFCC_Mean_4 | _MFCC_Mean_5 | _MFCC_Mean_6 | ... | _Chromagram_Mean_9 | _Chromagram_Mean_10 | _Chromagram_Mean_11 | _Chromagram_Mean_12 | _HarmonicChangeDetectionFunction_Mean | _HarmonicChangeDetectionFunction_Std | _HarmonicChangeDetectionFunction_Slope | _HarmonicChangeDetectionFunction_PeriodFreq | _HarmonicChangeDetectionFunction_PeriodAmp | _HarmonicChangeDetectionFunction_PeriodEntropy | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.046255 | 1.147712 | 0.785693 | 49.459481 | 1.883949 | 0.377859 | 0.865352 | 0.079494 | 0.219944 | 0.120032 | ... | 0.269816 | 1.136109 | 0.007925 | 0.091497 | 0.537622 | 0.200511 | 0.017918 | 0.807322 | 6.762849 | 5.938439e+30 |
| 1 | 0.095623 | 0.727046 | 0.764910 | 52.943619 | 1.900475 | 0.545123 | 0.767637 | 0.433917 | 0.549911 | 0.881112 | ... | 0.001995 | 1.136109 | -0.000000 | 0.487793 | 0.460935 | 0.169693 | -0.083691 | 1.931502 | 12.210196 | 5.010738e+30 |
| 2 | 0.041457 | 1.302833 | 0.793466 | 65.444342 | 1.512479 | 0.986259 | 0.494373 | 0.354595 | 0.285253 | 0.142846 | ... | 0.147740 | 0.824692 | 0.015702 | 0.491675 | 0.821232 | 0.222127 | 0.129709 | 1.179722 | 11.586369 | 3.993609e+30 |
| 3 | 0.101274 | 1.185439 | 0.792824 | 29.467297 | 1.534865 | 1.775386 | 0.600987 | 0.380042 | 0.010997 | 0.145968 | ... | 0.036190 | 1.136109 | 0.134976 | 0.424863 | 0.851187 | 0.202856 | 0.041560 | 0.319833 | 15.087478 | 5.302789e+30 |
| 4 | 0.056968 | 1.147712 | 0.789386 | 37.016335 | 1.656768 | 0.234033 | 0.795458 | 0.098256 | 0.430168 | 0.296447 | ... | 0.003979 | 0.428451 | 0.447120 | 0.000999 | 0.615226 | 0.200511 | 0.087067 | 0.617198 | 10.533691 | 2.839124e+30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 395 | 0.121233 | 1.107620 | 0.744632 | 58.198474 | 1.582646 | 0.065509 | 0.703238 | 0.046522 | 0.263501 | 0.105583 | ... | 0.248283 | 0.935896 | 0.275238 | 0.110760 | 0.555823 | 0.120580 | 0.116536 | 1.658183 | 27.746435 | 5.611702e+30 |
| 396 | 0.122175 | 0.878151 | 0.740478 | 63.229758 | 1.517594 | -0.145470 | 0.338298 | -0.010970 | 0.028981 | 0.039226 | ... | 0.019488 | 1.136109 | 0.359050 | 0.009898 | 0.345856 | 0.110819 | 0.140000 | 1.931502 | 29.365195 | 5.010738e+30 |
| 397 | 0.127222 | 1.044541 | 0.733938 | 50.607703 | 1.081312 | 0.600708 | 0.858658 | -0.109992 | 0.242721 | 0.220548 | ... | 0.048665 | 0.189290 | 0.213246 | 0.091497 | 0.423771 | 0.132968 | 0.108024 | 1.931502 | 22.030962 | 3.773116e+30 |
| 398 | 0.104013 | 1.092413 | 0.728113 | 44.628036 | 1.221377 | -0.205021 | 0.680127 | 0.090942 | 0.205078 | 0.062568 | ... | 0.115906 | 1.136109 | 0.222443 | 0.122376 | 0.442152 | 0.123533 | 0.060080 | 1.931502 | 21.187036 | 5.611702e+30 |
| 399 | 0.071178 | 0.817475 | 0.746010 | 55.607331 | 1.318279 | -0.013976 | 0.814626 | -0.020891 | 0.342518 | 0.276010 | ... | 0.087529 | 1.136109 | 0.084520 | 0.031917 | 0.271693 | 0.097658 | 0.006988 | 0.541146 | 25.344224 | 4.473616e+30 |
388 rows × 50 columns
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3,random_state=42)
standardscaler = StandardScaler()
x_train = standardscaler.fit_transform(X_train)
x_test = standardscaler.transform(X_test)
Logistic Regression
logrregression = LogisticRegression(random_state=0)
logrregression.fit(x_train,Y_train.values.ravel())
y_pred = logrregression.predict(x_test)
print(y_pred)
print(Y_test)
[3 2 1 2 0 1 1 3 1 1 0 2 1 3 3 1 2 3 3 3 0 0 2 1 0 3 0 0 2 0 3 2 3 0 0 2 3
0 3 1 1 1 2 1 2 3 3 0 3 3 0 2 3 2 3 1 3 2 2 2 1 3 0 3 0 2 2 0 0 1 0 2 1 2
2 2 1 0 2 3 1 1 0 3 1 3 3 1 3 0 2 0 2 3 0 2 2 0 2 2 3 0 0 2 1 3 0 0 2 3 0
1 3 1 1 3 0]
Class
279 3
46 2
172 1
42 2
359 0
.. ...
252 3
216 3
113 1
17 2
301 0
[117 rows x 1 columns]
accuracy = accuracy_score(Y_test,y_pred)
print(accuracy)
0.7777777777777778
cm = confusion_matrix(Y_test,y_pred)
cm
array([[26, 0, 1, 2],
[ 2, 23, 1, 2],
[ 1, 0, 25, 11],
[ 1, 2, 3, 17]])
KNN Classification
model = KNeighborsClassifier()
model.fit(x_train, Y_train.values.ravel())
y_pred = model.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.6495726495726496
cm = confusion_matrix(Y_test,y_pred)
cm
array([[26, 2, 0, 1],
[ 1, 27, 0, 0],
[ 7, 4, 15, 11],
[ 6, 4, 5, 8]])
model = KNeighborsClassifier(metric="manhattan")
model.fit(x_train, Y_train.values.ravel())
y_pred = model.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.6581196581196581
cm = confusion_matrix(Y_test,y_pred)
cm
array([[26, 1, 0, 2],
[ 1, 27, 0, 0],
[ 6, 2, 14, 15],
[ 6, 3, 4, 10]])
Support Vector Machine
svc = SVC(kernel='linear')
svc.fit(x_train,Y_train.values.ravel())
y_pred = svc.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.7264957264957265
cm = confusion_matrix(Y_test,y_pred)
cm
array([[28, 0, 1, 0],
[ 0, 24, 1, 3],
[ 4, 0, 20, 13],
[ 3, 3, 4, 13]])
svc = SVC(kernel='rbf')
svc.fit(x_train,Y_train.values.ravel())
y_pred = svc.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.7521367521367521
cm = confusion_matrix(Y_test,y_pred)
cm
array([[23, 1, 0, 5],
[ 0, 25, 1, 2],
[ 0, 0, 25, 12],
[ 1, 2, 5, 15]])
Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(x_train,Y_train.values.ravel())
y_pred = gnb.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.7863247863247863
cm = confusion_matrix(Y_test,y_pred)
cm
array([[23, 2, 1, 3],
[ 0, 25, 0, 3],
[ 0, 1, 29, 7],
[ 0, 3, 5, 15]])
Decision Tree
dtc = DecisionTreeClassifier(criterion="entropy")
dtc.fit(x_train,Y_train.values.ravel())
y_pred = dtc.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.6752136752136753
cm = confusion_matrix(Y_test,y_pred)
cm
array([[24, 1, 0, 4],
[ 2, 22, 2, 2],
[ 2, 3, 21, 11],
[ 3, 4, 4, 12]])
Random Forest
rfc = RandomForestClassifier()
rfc.fit(x_train,Y_train.values.ravel())
y_pred = rfc.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.8205128205128205
cm = confusion_matrix(Y_test,y_pred)
cm
array([[26, 0, 0, 3],
[ 1, 26, 0, 1],
[ 0, 1, 26, 10],
[ 1, 2, 2, 18]])
Dimensionality Reduction
pca = PCA(n_components=8)
pca.fit(X)
x_pca = pca.transform(X)
transformed = pd.DataFrame(x_pca)
X=transformed
X
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 9.003031e+29 | 46774.809364 | -926.896269 | -439.107779 | -37.528514 | -7.861490 | -7.855499 | -0.930623 |
| 1 | -2.739867e+28 | -15396.849296 | -2013.061404 | -762.800698 | -55.460648 | 2.776469 | -4.564301 | 5.838971 |
| 2 | -1.044527e+30 | -49533.852416 | -3042.806624 | -370.528999 | 33.938369 | 13.407332 | -5.777505 | 16.406478 |
| 3 | 2.646531e+29 | 24525.715202 | -4031.572854 | -763.691484 | 40.897985 | 5.702465 | 8.370228 | -16.578039 |
| 4 | -2.199013e+30 | -150150.116064 | -1882.520731 | -598.871764 | -56.325918 | 11.225291 | 2.046949 | -7.212007 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 383 | 5.735657e+29 | 41906.372599 | -2568.793272 | -679.189570 | -31.280710 | 2.245369 | 5.644715 | 13.338374 |
| 384 | -2.739867e+28 | 18278.013476 | -2670.222907 | -310.077079 | 64.282401 | -18.758019 | 1.193700 | 16.190012 |
| 385 | -1.265020e+30 | -77168.512531 | -545.411112 | -79.530364 | -8.676784 | 14.326846 | 2.173908 | 5.732842 |
| 386 | 5.735657e+29 | 42590.101548 | -1501.956360 | -439.647642 | -44.886751 | 7.903570 | 4.653567 | -1.812959 |
| 387 | -5.645205e+29 | -42199.305661 | -1706.781605 | -477.377457 | -11.777035 | 5.836857 | 6.503257 | 12.737869 |
388 rows × 8 columns
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3,random_state=42)
x_train = standardscaler.fit_transform(X_train)
x_test = standardscaler.transform(X_test)
classifier = KNeighborsClassifier()
classifier.fit(x_train, Y_train.values.ravel())
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| n_neighbors | 5 | |
| weights | 'uniform' | |
| algorithm | 'auto' | |
| leaf_size | 30 | |
| p | 2 | |
| metric | 'minkowski' | |
| metric_params | None | |
| n_jobs | None |
y_pred = classifier.predict(x_test)
print(accuracy_score(Y_test,y_pred))
0.48717948717948717
Hyperparameter Optimization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(solver='lbfgs', max_iter=1000))
])
param_grid = {'classifier__C': [0.001, 0.01, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 10, 100]}
grid_search = GridSearchCV(pipeline, param_grid, cv=10, scoring='accuracy')
grid_search.fit(X_train_scaled, Y_train)
# Get the best hyperparameters and accuracy
best_C = grid_search.best_params_['classifier__C']
best_accuracy = grid_search.best_score_
print("Best C:", best_C)
print("Best Accuracy:", best_accuracy)
Model Evaluation


